Bringing open source software practices to the scholarly publishing community for authors, reviewers, editors, and publishers
Published
July 20, 2024
Introduction
This report contains the code for generating figures and numbers presented in
Diehl et al (2024): The Journal of Open Source Software (JOSS) - Bringing open source software practices to the scholarly publishing community for authors, reviewers, editors, and publishers
Load packages
We first load the R packages that are needed for the analysis and plotting.
The figure below shows the number of papers published each month. The overlaid curve represents a loess fit to the monthly data, generated using the ggplot2 package.
Code
(gg_nbrpapers <-ggplot(summarydf_month, aes(x =factor(pubmonth), y = npub)) +geom_col(aes(fill =as.character(pubyear))) +geom_smooth(aes(x =as.numeric(factor(pubmonth))), method ="loess", se =FALSE, color ="black") +labs(x ="Publication month", y ="Number of published\npapers per month") +scale_y_continuous(expand =c(0, 0)) +scale_fill_manual(name ="Year", values = yearcols) +guides(fill =guide_legend(nrow =1, byrow =TRUE, title ="Publication year")) +theme_cowplot() +theme(axis.title =element_text(size =15),axis.text.x =element_blank(),axis.ticks.x =element_blank(), legend.position ="bottom",legend.justification ="right",legend.margin =margin(1, 10, 1, 1)))
Number of editors per year
The next figure shows the number of editors that accept at least one paper in a given year, as well as the total number of editors that have accepted at least one paper overall. Note that the data from 2024 only include papers published until 2024-07-17.
Code
(gg_nbreditors <-ggplot(summarydf_year,aes(x = pubyear, y = nbr_editors, fill = pubyear)) +scale_fill_manual(name ="Year", values = yearcols) +geom_col() +annotate(geom ="text", x =1, y =0.9*max(summarydf_year$nbr_editors), label =paste0("Total number\nof editors: ", tot_nbr_editors), hjust =0, vjust =1, size =5) +scale_y_continuous(expand =c(0, 0)) +labs(x ="Publication year", y ="Number of editors") +theme_cowplot() +theme(legend.position ="none"))
Number of new editors per year
We next illustrate the number of editors that accept their first paper in a given year, as well as what fraction this represents of the total number of editors accepting a paper that year. Note that the data from 2024 only include papers published until 2024-07-17.
Code
ggplot(summarydf_year,aes(x = pubyear, y = nbr_new_editors, fill = pubyear)) +scale_fill_manual(name ="Year", values = yearcols) +geom_col() +geom_text(aes(label = frac_new_editors), vjust =-0.2) +scale_y_continuous(expand =expansion(mult =c(0, .1))) +labs(x ="", y ="Number of new editors and\npercentage of total number of editors") +theme_cowplot() +theme(legend.position ="none")
Code
## Combine - show both the total number of editors and the number of new onesggplot(summarydf_year,aes(x = pubyear, fill = pubyear)) +scale_fill_manual(name ="Year", values = yearcols) +geom_col(aes(y = nbr_editors), alpha =0.25) +geom_col(aes(y = nbr_new_editors)) +geom_text(aes(y = nbr_new_editors, label = frac_new_editors), vjust =-0.2) +scale_y_continuous(expand =c(0, 0)) +labs(x ="", y ="Number of new editors and\npercentage of total number of editors") +theme_cowplot() +theme(legend.position ="none")
Number of reviewers per year
Similarly to the number of editors above, the figure below shows the number of reviewers reviewing at least one paper in a given year, as well as the total number of reviews submitted in a year. Also here, the data from 2024 only include papers published until 2024-07-17.
Code
(gg_nbrreviewers <-ggplot(summarydf_year,aes(x = pubyear, y = nbr_reviewers)) +scale_fill_manual(name ="Year", values = yearcols) +geom_col(aes(fill = pubyear)) +geom_line(aes(x =as.numeric(pubyear), y = nbr_reviews), color ="grey",linewidth =1.5) +geom_point(aes(x =as.numeric(pubyear), y = nbr_reviews), color ="grey",size =2.5) +geom_col(aes(fill = pubyear)) +annotate(geom ="text", x =1, y =0.9*max(summarydf_year$nbr_reviews), label =paste0("Total number\nof reviewers: ", tot_nbr_reviewers), hjust =0, vjust =1, size =5) +scale_y_continuous(expand =expansion(mult =c(0, .05))) +labs(x ="Publication year", y ="Number of reviewers\nand reviews", caption ="Bars show number of unique reviewers,\ngrey line shows total number of reviews") +theme_cowplot() +theme(legend.position ="none"))
Number of ‘new’ reviewers per year
We also plot the number of reviewers reviewing their first paper in a given year, and calculate what fraction of the total number of reviewers that year that this represents. As above, the data from 2024 only include papers published until 2024-07-17.
Code
ggplot(summarydf_year,aes(x = pubyear, y = nbr_new_reviewers)) +scale_fill_manual(name ="Year", values = yearcols) +geom_col(aes(fill = pubyear)) +geom_text(aes(label = frac_new_reviewers), vjust =-0.2) +scale_y_continuous(expand =expansion(mult =c(0, .1))) +labs(x ="", y ="Number of first-time reviewers and\npercentage of total number of reviewers") +theme_cowplot() +theme(legend.position ="none")
Number of reviewers per submission
Here we illustrate the distribution of the number of reviewers assigned to each submission, over time. We exclude submissions that have already been reviewed at rOpenSci or pyOpenSci, since they are not re-reviewed at JOSS.
Code
nrev <- papers |> dplyr::filter(!grepl("rOpenSci|pyOpenSci", prerev_labels)) |> dplyr::select(pubyear, title, nbr_reviewers, doi)(gg_nrevpersub <-ggplot( nrev, aes(x = nbr_reviewers, fill = forcats::fct_relevel(pubyear, rev(levels(pubyear))))) +geom_bar() +scale_fill_manual(values = yearcols, name ="Year") +scale_y_continuous(expand =c(0, 0)) +labs(x ="Number of reviewers per submissions", y ="Number of submissions",caption ="Submissions reviewed via rOpenSci/pyOpenSci are excluded") +theme_cowplot() +theme(legend.position ="none"))
Since 2020, all papers are reviewed by at least two reviewers. The handful of exceptions represent two addendum papers and three cases where the editor replaced one reviewer who dropped out during the process.
Time in review
In these plots we investigate how the time a submission spends in the pre-review or review stage (or their sum) has changed over time. The curve corresponds to a rolling median for submissions over 120 days.
Code
## Helper functions (modified from https://stackoverflow.com/questions/65147186/geom-smooth-with-median-instead-of-mean)rolling_median <-function(formula, data, xwindow =120, ...) {## Get order of x-values and sort x/y ordr <-order(data$x) x <- data$x[ordr] y <- data$y[ordr]## Initialize vector for smoothed y-values ys <-rep(NA, length(x))## Calculate median y-value for each unique x-valuefor (xs insetdiff(unique(x), NA)) {## Get x-values in the window, and calculate median of corresponding y j <- ((xs - xwindow/2) < x) & (x < (xs + xwindow/2)) ys[x == xs] <-median(y[j], na.rm =TRUE) } y <- ysstructure(list(x = x, y = y, f =approxfun(x, y)), class ="rollmed")}predict.rollmed <-function(mod, newdata, ...) {setNames(mod$f(newdata$x), newdata$x)}
Code
data.frame(`Median number of days in pre-review`=round(median(papers$days_in_pre, na.rm =TRUE), 1),`Mean number of days in pre-review`=round(mean(papers$days_in_pre, na.rm =TRUE), 1),`Median number of days in review`=round(median(papers$days_in_rev, na.rm =TRUE), 1),`Mean number of days in review`=round(mean(papers$days_in_rev, na.rm =TRUE), 1),`Median number of days in pre-review + review`=round(median(papers$days_in_pre + papers$days_in_rev, na.rm =TRUE), 1),`Mean number of days in pre-review + review`=round(mean(papers$days_in_pre + papers$days_in_rev, na.rm =TRUE), 1),check.names =FALSE ) |> tidyr::pivot_longer(everything()) |> kableExtra::kbl(col.names =NULL) |> kableExtra::kable_styling()
(gg_timeinrev <-ggplot(papers, aes(x = prerev_opened, y =as.numeric(days_in_pre) +as.numeric(days_in_rev),color = pubyear)) +geom_point() +annotate(geom ="text", x =as.Date("2016-10-01"), y =950, label = textannot, hjust =0) +geom_smooth(formula = y ~ x, method ="rolling_median", se =FALSE, method.args =list(xwindow =120),color ="black") +scale_color_manual(values = yearcols, name ="Year") +labs(x ="Date of pre-review opening", y ="Number of days in\npre-review + review") +theme_cowplot() +theme(legend.position ="none"))
Number of comments per review issue
Here, we count the number of comments made in the review GitHub issues for each submission. We remove comments made by the editorial bot (user name @whedon or @editorialbot). Note that issues opened in the software repositories themselves, or comments therein, are not counted.
ncomments <-readRDS("review_issue_nbr_comments.rds")(gg_nbrcomments <-ggplot( papers |>left_join(ncomments,by =join_by(alternative.id, review_issue_id)), aes(x = nbr_comments_nobot, y = pubyear)) +geom_density_ridges(aes(fill = pubyear)) +scale_fill_manual(values = yearcols, name ="Year") +labs(x ="Number of comments (not by bot) in review issue", y ="") +theme_cowplot() +theme(legend.position ="none"))
Picking joint bandwidth of 4.33
Funding statement statistics
We performed a manual check of all papers published in JOSS in 2023 to extract information about whether any acknowledgement of funding was made. Here we summarize these numbers.
Finally, we calculate some statistics related to the citation of JOSS and SoftwareX papers. This information has been retrieved from OpenAlex, using the openalexR R package.
All papers
Code
## All paperstmp <- papers$citation_countcat("Number of papers: ", length(tmp), "\n")cat("Number of citations: ", sum(tmp, na.rm =TRUE), "\n")cat("Summary statistics: \n")summary(tmp)
Number of papers: 2564
Number of citations: 67542
Summary statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 0.00 3.00 26.41 9.00 11020.00 7
The most cited paper:
Code
maxcit <-which.max(papers$citation_count)cat(paste0(papers$author[maxcit][[1]]$family[1], " et al (", papers$pubyear[maxcit], "): ", papers$title[maxcit], " (", papers$citation_count[maxcit], " citations)"))
Wickham et al (2019): Welcome to the Tidyverse (11020 citations)
Papers published in 2016-2023
Code
## Papers published in 2016-2023tmp <- papers$citation_count[papers$pubyear %in%c(2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023)]cat("Number of papers: ", length(tmp), "\n")cat("Number of citations: ", sum(tmp, na.rm =TRUE), "\n")cat("Summary statistics: \n")summary(tmp)
Number of papers: 2264
Number of citations: 67458
Summary statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 1.0 3.0 29.8 10.0 11020.0
Number of papers: 1154
Number of citations: 12058
Summary statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 1.00 3.00 10.45 10.00 442.00
Put together final figures
Code
## Overwrite the get_legend function from cowplot temporarily, ## since the cowplot one doesn't work with ggplot2 3.5.0## see https://github.com/wilkelab/cowplot/issues/202get_legend <-function(plot, legend =NULL) { gt <-ggplotGrob(plot) pattern <-"guide-box"if (!is.null(legend)) { pattern <-paste0(pattern, "-", legend) } indices <-grep(pattern, gt$layout$name) not_empty <-!vapply( gt$grobs[indices], inherits, what ="zeroGrob", FUN.VALUE =logical(1) ) indices <- indices[not_empty]if (length(indices) >0) {return(gt$grobs[[indices[1]]]) }return(NULL)}